The overall objective of this project will be to analyze burglary data for Chicago, IL from 2015 to 2019.
Throughout this tutorial, we will attempt to find when and where burglaries are most likely to take place, while also complementing our analysis with interesting burglary trends and statistics.
This project is written in python 3.91.
You will need the following libraries:
!pip install folium # install to create maps
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import folium
from sodapy import Socrata
from folium.plugins import HeatMap
This is the first stage of the data lifecycle. Here, we will collect all the data that we will need for our project.
The main dataset that we will be using contains all reported crimes in the city of Chicago since 2001 and can be found in the official Chicago Data Portal.
The data is stored in a large csv file, which we will be accessing using the sodapy client through the Socrata Open Data API.
From this file we will only extract crime data for the years 2015-2019.
# These can be found in the data portal
domain = 'data.cityofchicago.org'
dataset_id = 'ijzp-q8t2'
# Generate token by creating an account for the data portal
token = 'Lkysyak9elTtcNXRVmfsj9YLX'
client = Socrata(domain, token)
# Get data for 2015-2019
results = client.get(dataset_id, where="date >= '2015-01-01' and date < '2020-01-01'", limit=2000000)
# Store into pandas dataframe
crime_table = pd.DataFrame.from_dict(results)
# Display first 5 rows of dataframe
crime_table.head()
Additionally, we will be using some data downloaded from FBI's Crime Data Explorer in csv format. These burglary-specific datasets include statistics about victims' and offenders' age, sex, and race, as well the relationship between victims and offenders and other crimes that burglary offenders have been charged with.
To import this data into dataframes, we will be using pandas' read_csv method.
# Burglary offenders by age
offender_age = pd.read_csv("https://annivas.github.io/files/offender-age-2015-2019.csv")
offender_age
# Burglary offenders by sex
offender_sex = pd.read_csv("https://annivas.github.io/files/offender-sex-2015-2019.csv")
offender_sex
# Burglary offenders by race
offender_race = pd.read_csv("https://annivas.github.io/files/offender-race-2015-2019.csv")
offender_race
# Burglary victims by age
victim_age = pd.read_csv("https://annivas.github.io/files/victim-age-2015-2019.csv")
victim_age
# Burglary victims by sex
victim_sex = pd.read_csv("https://annivas.github.io/files/victim-sex-2015-2019.csv")
victim_sex
# Burglary victims by race
victim_race = pd.read_csv("https://annivas.github.io/files/victim-race-2015-2019.csv")
victim_race
# Relationship between burglary offenders and victims
victim_offender_relationship = pd.read_csv("https://annivas.github.io/files/victim-offender-relationship-2015-2019.csv")
victim_offender_relationship
# Other offenses linked to burglary offenders
linked_offenses = pd.read_csv("https://annivas.github.io/files/linked-offenses-2015-2019.csv")
linked_offenses
Now that we have collected all the necessary data, it's time to process it and organize it in a way that will serve our needs for the remainder of the project.
First, let's extract all burglaries from the crime table into a new table. We will only choose the columns we need, as the initial crime table is filled with unnecessay information.
# Get burglaries from crime table (only the columns we need)
burglary_table = crime_table[['id', 'primary_type', 'description', 'arrest', 'location', 'latitude', 'longitude', 'date', 'year']].loc[crime_table['primary_type']=='BURGLARY']
# Display first 5 rows
burglary_table.head()
Now that we have only the data we need, let's add some new columns deriving from the "date" column
# Create month column. Months are represented as ints from 1 (January) to 12 (December).
# We could represent months as strings, but integers facilitate plotting.
burglary_table['month'] = pd.DatetimeIndex(burglary_table['date']).month
# Display first 5 rows
burglary_table.head()
# Create day column. Days are represented as ints from 0 (Monday) to 6 (Sunday).
# We could represent days as strings, but integers facilitate plotting.
burglary_table['day'] = pd.DatetimeIndex(burglary_table['date']).weekday
# Display first 5 rows
burglary_table.head()
# Create time column. Time is expressed in hours and hours are represented as ints from 0 (12 am) to 23 (11 pm)
burglary_table['time'] = pd.DatetimeIndex(burglary_table['date']).hour
# Display first 5 rows
burglary_table.head()
For the complementary data we imported, the only processing that needs to be done is setting the "Key" column as the index of each table and sorting the tables by "Value" to facilitate plotting.
offender_age = offender_age.set_index('Key').sort_values(by="Value", ascending=False)
offender_sex = offender_sex.set_index('Key').sort_values(by="Value", ascending=False)
offender_race = offender_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_age = victim_age.set_index('Key').sort_values(by="Value", ascending=False)
victim_sex = victim_sex.set_index('Key').sort_values(by="Value", ascending=False)
victim_race = victim_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_offender_relationship = victim_offender_relationship.set_index('Key').sort_values(by="Value", ascending=False)
linked_offenses = linked_offenses.set_index('Key').sort_values(by="Value", ascending=False)
Now that our data is clean and organized, it's time to analyze it through the use of visualizations. This is usually the most interesting part of the data lifecycle, as we will attempt to plot our data and observe possible trends.
First, we will use the original crime table to measure the occurrences of each type of crime in the last 5 years.
# Caluculate number of each crime type occurrence in crime_table
crime_type_occ = crime_table['primary_type'].value_counts()
crime_type_occ
From the above data, theft looks to be the most common crime in Chicago, while burglary is 8th.
Now, let's plot the 12 most common types of crime in a pie chart to get a better idea.
crime_type_occ[0:12].plot(kind='pie', figsize=(10, 10), title="Types of Crime", autopct='%1.1f%%')
plt.ylabel("")
plt.show()
By using the burglary_table, we can plot the number of burglaries by year and hopefully observe a trend.
burglary_table['year'].value_counts().sort_index().plot(kind='bar', rot=0, title="Burglaries by Year", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.show()
From the above bar plot, we can tell that in the last 5 years, 2016 had the most burglaries. Most importantly, there seems to be a decreasing trend since 2016, meaning that the number of burglaries has only decreased since then.
Now let's try to visualize burglaries by month. By counting the occurrences of each month in our burglary table, we can get the average number of burglaries occurred by month throughout the last 5 years.
# Total number of burglaries for 5 years, grouped by month
burglaries_by_month = burglary_table['month'].value_counts().sort_index()
# Divide value for each month by 5 to get normalized number of burglaries per month
burglaries_by_month = burglaries_by_month.apply(lambda x: x/5)
burglaries_by_month
figure(num=None, figsize=(14, 8))
x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
y = burglaries_by_month
plt.bar(x,y)
plt.title("Burglaries by Month")
plt.ylabel("Number of Burglaries")
plt.show()
After plotting the average number of burglaries per month, we can start observing some trends. The most burglaries occur in August (a little less than 1200), followed by July, which could be attributed to the fact that many homes are left unoccupied during summer vacations. February seems to have the lowest average number of burglaries (about 2/3 of August's burglaries), meaning that households are the safest during that month of the year.
We can also plot the number of burglaries by day of the week.
# Total number of burglaries for 5 years, grouped by day of week
burglaries_by_day = burglary_table['day'].value_counts().sort_index()
# Divide value for each day by 5 to get number of burglaries for each year by day
burglaries_by_day = burglaries_by_day.apply(lambda x: x/5)
# Divide value for each day by 52.1429 (number of weeks in a year) to get normalized number of burglaries by day of week
burglaries_by_day = burglaries_by_day.apply(lambda x: x/52.1429)
burglaries_by_day
figure(num=None, figsize=(14, 8))
x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
y = burglaries_by_day
plt.bar(x,y)
plt.title("Burglaries by Day of the Week")
plt.ylabel("Number of Burglaries")
plt.show()
There seem to be about 35 burglaries per weekday in Chicago, while the number is lower on weekends. This could be attributed to the fact that most people are at work on weekdays, and empty houses make better targets for burglars. Weekends seem to be less suitable days for burglaries, as most families stay at home.
Now let's dive a step deeper, and plot the number of burglaries by time of the day.
# Total number of burglaries for 5 years, grouped by hour in the day
burglaries_by_time = burglary_table['time'].value_counts().sort_index()
# Divide value for each hour by 5 to get number of burglaries for each year per hour
burglaries_by_time = burglaries_by_time.apply(lambda x: x/5)
# Divide value for each day by 8760 (number of hours in a year) to get normalized number of burglaries by time
burglaries_by_time = burglaries_by_time.apply(lambda x: x/8760)
burglaries_by_time
burglaries_by_time.plot(kind='bar', rot=0, title="Burglaries by Time of the Day", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.show()
It might be expected that most burglaries occur at nightime. However, according to the above bar plot, most burglaries in Chicago occur around 8am, 9am, and 12pm. In fact, almost one burglary occurs at these times every day. Burglaries are least likely to occur from 1am to 6am. A possible reason for this trend could be the same as above. There seems to be an increase in the number of burglaries at the times when most people leave home for work. It looks like empty homes are preferred by burglars.
These observations seem interesting. Let's also visualize them in a line plot.
burglaries_by_time.plot(rot=0, title="Burglaries by Time of the Day", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.show()
This line plot confirms the observations from our bar plot and shows the big difference in burglary occurrence between different times of the day.
Now that we have determined when burglaries are most likely to occur, let's observe where they are most likely to occur.
We will do this by creating an interactive heat map indicating the areas of Chicago with the highest concentration of burglaries.
The burglary table contains a very large number of datapoints, which would make our heatmap ugly and unreadable. To improve readability and accuracy, we will be using a random sample of size 10,000.
# Take random sample of 10,000 rows
sample_table = burglary_table.sample(n=10000)
# Display first 5 rows
sample_table.head()
To map our sample, we will be using the folium package
# Create map
map_osm = folium.Map(location=[41.88, -87.63], zoom_start=11)
# Drop rows where location is missing
heat_table = sample_table[sample_table['location'].notna()]
# Get heat data from sample
heat_data = [[row['latitude'], row['longitude']] for index, row in heat_table.iterrows()]
# Create heat map
HeatMap(heat_data, radius=20).add_to(map_osm)
map_osm
Now let's make our map a bit more descriptive, by adding some more data.
We will be creating circles, indicating the location of each burglary. By clicking on the circles, one will be able to see the incident description. Additionally, green circles will indicate burglaries where the offender has been arrested, while black circles will mean that the offender has not been arrested.
# Add circles
for index, row in heat_table.iterrows():
color=''
if row['arrest'] == True:
color = 'green'
else:
color = 'black'
folium.Circle(
radius = 20,
location = [row['latitude'], row['longitude']],
popup = row['description'],
color = color,
fill = True,
).add_to(map_osm)
map_osm